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Abstract 

We briefly review the inside-outside and EM algorithm for probabilistic context-free grammars. As a 
result, we formally prove that inside-outside estimation is a dynamic-programming variant of EM. This 
is interesting in its own right, but even more when considered in a theoretical context since the well- 
known convergence behavior of inside-outside estimation has been confirmed by many experiments but 
apparently has never been formally proved. However, being a version of EM, inside-outside estimation 
also inherits the good convergence behavior of EM. Therefore, the as yet imperfect line of argumentation 
can be transformed into a coherent proof. 

1 Inside-Outside Estimation 

The modern inside-outside algorithm was introduced by 0] who reviewed an algorithm proposed 
by ^ and extended it to an iterative training method for probabilistic context-free grammars enabling 
the use of unrestricted free text. In the following, yi . . - Un are numbered (but unannotated) sentences. 

Definition: Inside-outside re-estimation formulas for probabilistic context-free grammars in Chom- 
sky normal form are given by (see |3], but see also ^ for the special case iV = 1): 

p{A ^ «) := r (A\ '^"^ ^ ■= — r (A\ — ■ 

The key variables of this definition are so-called category and rule counts: Cw{A) := 
T ELi Er=s A) ■ f{s, t, A), CUA - a) := ^ Ei<t<„, ^,=a ^) ' fi^, t, A), and C^A ^ 
BC) := EL.+iEt=tM^ ^ BC)eis,r,B)e{r + l,t,C)f{s,t,A) which are computed for 

each sentence w := wi . . .Wn with so-called inside and outside probabilities: An inside probability 
is defined as the probability of category A generating observations Wg ■ ■ - Wt, i.e. e(s, t, A) := p{A =>* 
Wg ■ ■ ■ Wf). In determining a recursive procedure for calculating e, two cases must be considered: 

• {s = t): Only one observation is emitted and therefore a rule of the form A ^ Ws applies: e(s, s. A) = 
p{A Ws), if {A — *■ Ws) G G (and 0, otherwise). 

• (s < t): In this case we know that rules of the form A BC must apply since more 
than one observation is involved. Thus, e{s,t,A) can be expressed as follows; e{s,t,A) = 
EiA^BOeG El'Js PiA^ BC) ■ e{s, r, B) ■ e{r + 1, t, C). 

The quantity e can therefore be computed recursively by determining e for all sequences of length 
1, then 2, and so on. The sentence probability P :— p{S w) is a special inside probability. 
The outside probabilities are defined as follows: /(s, t,A) = p {S wi . . . Ws~iAwt+i ■ ■ ■ Wn) ■ 



The quantity f(s,t,A) may be thought of as the probabihty that A is generated in the re-write 
process and that the strings not dominated by it are wi . . . Ws^i to the left and wt+i ■ ■ ■ Wn 
to the right. In this case, the non-terminal A could be one of two possible settings C — >• 
B A ov C ^ A B, hence: /(s, t, A) = Es, ceG ( EZ\ /(^ ^> C) ' Pi^ ^ BA) ■ e(r, s-l,B) + 

/(s, r, C) ■ p{C AB) ■ e{t + 1, r, B)) and /(s, i, A) = J ^ if A - 5* ^ ^^^^^ 

else 

probabilities have been computed bottom-up, the outside probabilities can therefore be computed 
top-down. Unfortunately, no convergence proofs of inside-outside estimation were given by ^ and 0] . 

2 EM for Probabilistic Context-Free Grammars 

The EM algorithm was introduced by [3] as iterative maximum likelihood estimation for parameter- 
ized probability models p{y) using a sample p{y) of incomplete data types y which are defined 
via a symbolic analyzer X{y) dealing with complete data types x. It is known, that EM 
generalizes ordinary maximum likelihood estimation and monotonically increases the log-likelihood 
L{p) := J2y Piv) ' ^'^s'l2x£X{y)Pi^)- Furthermore, the limit point of a convergent parameter sequence 
is a stationary point (i.e. local minimum, saddle point or maximum) of the log likelihood [3]. More- 
over, both the parameter sequence and the associated sequence of log likelihood values converge (in 
some cases to local maxima), if some weak conditions are fulfilled j^]. 

Applying EM to probabilistic context-free grammars, the grammatical sentences y are viewed 
as incomplete and their syntsLX trees x as complete. The required symbolic analyzer is given by a 
parser computing all trees x £ T{y) for a sentence y. Via these non-probabilistic EM components, the 
probabihty model for the sentences is defined as p{y) := J2xeT{y) P(^) ■~ J2xeT{y) IlrPi''^)'^''''^^ ^ where 
fr{x) is the frequency of rule r occuring in x, and parameterization is given by rule probabilties 
p{r). The key variables of EM re-estimation are conditional expected frequencies (relying on the 
conditional probability p{x\y) := £|^) for rules r and categories A: p{.\y) [ fr ] := J2xGTiy)Pi^\y) ' 
fr{x) and p{.\y) [Ja] := Y.xeT{y)Pi^\y) ' fA{x), where /^(x) := Y^reGA ■f^'^^^ frequency of 

category A occuring in x, and Ga is the set of grammar rules with left-hand side A. See e.g. I^: 

Lemma: EM re-estimation formulas for probabilistic context-free grammars are given by: 

pipi-\-)[ fAii 22yPiy) -pi-mi Ia] 

3 Inside-Outside as Dynamic EM 

In this section, the well-known convergence properties of the inside-outside algorithm, which have been 
unfortunately omitted in the original literature ^), will be formally proven. For this purpose, we 
will show that the inside-outside algorithm is a dynamic-programming variant of the EM algorithm 
for context-free grammars. This property is also well-known in stochastic linguistics, but to the best 
of our knowledege all mentioned properties have not been formally proven till now. 

Theorem: For a context-free grammar in Chomsky normal form, let p{r) be re-estimated rule 
probabilities resulting from one single step of the inside-outside algorithm using the current rule 
probabilities p{r). Then: (i) The log likelihood L{.) of the training corpus increases monotonically, 
i.e. L{p) > L{p). (ii) The limit points of a sequence of re-estimated probabilities are stationary 
points (i.e. maxima, minima or saddle points) of the log likelihood function, (iii) The inside-outside 



algorithm is a dynamic-programming variant of the EM algorithm, i.e. p(r) corresponds to PEAiif) 
resulting from one single EM iteration (using also p{r) as current rule probabilities). 

Proof: (i) and (ii) follow using both (iii) and the convergence properties of EM. (iii) : The empirical 
distribution of the sentences is defined as p{y) = where f{y) is the frequency of y occuring in the 

corpus "Vi . . .Wat. Thus, for each rule r with left-hand side A: vem{t) = r ^y^"^ r ^^^^'"' ^ ■ 

Comparing these formulas with the re-estimation formulas presented by (4j, it follows pem{t) — p(r), 
if for each sentence ?/, for each rule r and each category A the following propositions can be shown: 

Cy(r)^ X! • /'■(^)'and Cy{A) = ^ p[x\y) ■ ] a(x) . 

xeT{y) xeT(y) 

This is the goal of the rest of the proof, which wc split in two lemmas. The first lemma is probably 
due to [2], where corresponding formulas are used, but not explicitly proven, to present inside-outside 
estimation. The lemma says that category counts can be computed by summing certain rule counts. 

Lemma: Cy{A) — X^reG^ ^v^''') ^^'^ ^^^^ sentence y and each category A. 

Proof: Assuming Chomsky normal form, and y = wi . . .w„ 



^ Cy{r) = ^C,(A^a) + Cy{A^BC) 



reGA a B.CeG 



E e{t,t,A) fit,t,A) 



P 



n—1 n t~l 



p 

B.CeG s=l t=s+l r=s 



<t,t,A) f{t,t,A) 

, l<t<n 

n—1 n t—1 

+ E E /(^'*'^) E E^'(^^^^)^(^''''^)^(^ + i'*'^) 

s^l B,CgG r^s 

^ / n—1 n 

- Y e{t,t,A) f{t,t,A) + E E f{s,t,A) e{s,t,A) 



. l<t<n s=l t=s+l 



- - ^ e{s,t,A) f{s,t,A) = Cy{A) . 

l<s<t<n 

In the fourth equation, we used the recursion formula of the inside probabilities, q.e.d. 

It follows that the desired identities for the category counts can be calculated (by summation 
over all rules with the same left-hand side) using the identities for the rule counts, since Cy{A) = 
'^A~*a ^yi-^ definition fA{x) = X^a^q fA-^aix) ■ Thus, the proof of the theorem is 

completed, as once as the following central lemma has been proven. It states that the counts of the 
inside-outside algorithm can be identified with the expected rule frequencies of the EM algorithm. 

Lemma: For each sentence y and each rule r: Cy{r) — J2xeT{y)Pi^\y) ' fr{^) = P{-\y)[fr] ■ 

Proof: The second equation is simply the definition of the expectation. Assuming Chomsky normal 
form, two cases must be considered. First, the rule has the form A ^ B C: 

For a given sentence y = wi . . . Wn and given three spans (s, r, B), {r + 1, t, C), (s, t, A) with 1 < 
s < r < t < n, let A^(s,t.A)(s,r,s)(r+i.t,c) be the parse forest corresponding to the following deriva- 
tion: S ^* Wi . . . Ws-l A Wf+i . . .Wn ^ Wi . . . Ws~l B C Wt+l ■ ■ . W„ Wi . . .Wr C Wt+1 ■ ■ ■ Wn ^* 



J . _r /X jllfa;G X(s.t,A){s.r,B){r+l,t.C) 1,11 , 

wi . . .Wn- l^et frs.t.A)(s,r.B)(r+i.t,c)[x) be the character- 

I else 

istic function interpreting ^(s,t,A)(s,r,B)(r+i,t,c) ^ simple subset of the set of all possible syntax 
trees T{y) of the sentence y. Thus, the frequency fA^Bc{x) of the rule A ^ B C occurring in the 
syntax tree x G T{y) can be computed as follows: 



fA-*Bc{x) = 2Z hs,t,A){s,r,B)(r+l,t,C){x) ■ 

l<s<r<t<n 

Using the linear properties of the expected frequencies p{.\y) [ . ] , it follows: 

Pi-\y)[ Ia^BC ] = P{-\y) X! f(s,t,A)(s,T,B)(T+l,t,C) 

l<s<r<t<n 

— X/ P^-\y) [ I(s,t,A)(s,r,B)(r+l,t,C) ] 

l<s<r<t<n 

= X] X! P{x\y) ■ f(s,t,A)(s,r,B){r+l,t,C){x) 

l<s<r<t<n xeT(y) 

= ^ 51 51 P{x) ■ f(s,t,A)(s,r,B)(r+l,t,C){x) 

l<s<r<t<n x£T(y) 



^(s,t,A)(s,r,B)(r + l,t,C) 



P^y^ l<s<r<t<n xeXi^, 

-7-T E P{X{s,UA)(s,r,B){r+l,t.C)) 
l<s<r<t<n 

p ^ f{s,t,A)-p{A^BC)-e{s,r,B)-e{r + l,t,C) 

l<s<r<t<n 
Cy{A ^B C) . 



The second case, for rules of the form A — > a, follows analogously with spans (s,s,A) and (s,s,a). 
Here, the details are omitted, but see [5] q.e.d. 
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